Biostat 203B Homework 2

Due Feb 9 @ 11:59PM

Author

Kathy Hoang and 506333118

Display machine information for reproducibility:

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.2

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.2    fastmap_1.1.1     cli_3.6.2        
 [5] tools_4.3.2       htmltools_0.5.7   rstudioapi_0.15.0 yaml_2.3.8       
 [9] rmarkdown_2.25    knitr_1.45        jsonlite_1.8.8    xfun_0.41        
[13] digest_0.6.34     rlang_1.1.3       evaluate_0.23    

Load necessary libraries (you can add more as needed).

library(arrow)
library(data.table)
library(memuse)
library(pryr)
library(R.utils)
library(tidyverse)
library(lubridate)
library(duckdb)

Display memory information of your computer

memuse::Sys.meminfo()
Totalram:  16.000 GiB 
Freeram:   57.484 MiB 

In this exercise, we explore various tools for ingesting the MIMIC-IV data introduced in homework 1.

Display the contents of MIMIC hosp and icu data folders:

ls -l ~/mimic/hosp/
total 8879064
-rw-rw-r--@ 1 kathyhoang  staff    15516088 Jan  5  2023 admissions.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff      427468 Jan  5  2023 d_hcpcs.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff      859438 Jan  5  2023 d_icd_diagnoses.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff      578517 Jan  5  2023 d_icd_procedures.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff       12900 Jan  5  2023 d_labitems.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff    25070720 Jan  5  2023 diagnoses_icd.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff     7426955 Jan  5  2023 drgcodes.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff   508524623 Jan  5  2023 emar.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff   471096030 Jan  5  2023 emar_detail.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff     1767138 Jan  5  2023 hcpcsevents.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff  1939088924 Jan  5  2023 labevents.csv.gz
-rw-r--r--@ 1 kathyhoang  staff         233 Feb  8 17:16 labevents_filtered.csv
-rw-rw-r--@ 1 kathyhoang  staff    96698496 Jan  5  2023 microbiologyevents.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff    36124944 Jan  5  2023 omr.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff     9881607 Jan 30 17:39 patients.csv
-rw-rw-r--@ 1 kathyhoang  staff     2312631 Jan  5  2023 patients.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff   398753125 Jan  5  2023 pharmacy.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff   498505135 Jan  5  2023 poe.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff    25477219 Jan  5  2023 poe_detail.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff   458817415 Jan  5  2023 prescriptions.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff     6027067 Jan  5  2023 procedures_icd.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff      122507 Jan  5  2023 provider.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff     6781247 Jan  5  2023 services.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff    36158338 Jan  5  2023 transfers.csv.gz
ls -l ~/mimic/icu/
total 6155968
-rw-rw-r--@ 1 kathyhoang  staff       35893 Jan  5  2023 caregiver.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff  2467761053 Jan  5  2023 chartevents.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff       57476 Jan  5  2023 d_items.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff    45721062 Jan  5  2023 datetimeevents.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff     2614571 Jan  5  2023 icustays.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff   251962313 Jan  5  2023 ingredientevents.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff   324218488 Jan  5  2023 inputevents.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff    38747895 Jan  5  2023 outputevents.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff    20717852 Jan  5  2023 procedureevents.csv.gz

Q1. read.csv (base R) vs read_csv (tidyverse) vs fread (data.table)

Q1.1 Speed, memory, and data types

There are quite a few utilities in R for reading plain text data files. Let us test the speed of reading a moderate sized compressed csv file, admissions.csv.gz, by three functions: read.csv in base R, read_csv in tidyverse, and fread in the data.table package.

Which function is fastest? Is there difference in the (default) parsed data types? How much memory does each resultant dataframe or tibble use? (Hint: system.time measures run times; pryr::object_size measures memory usage.)

# Look up documentation for system.time and object_size:
# help(system.time)
# help(object_size)

# Speed
dot_csv_time <- system.time(read.csv("~/mimic/hosp/admissions.csv.gz"))
dot_csv_time
   user  system elapsed 
  3.832   0.056   3.914 
underscore_csv_time <- system.time(read_csv("~/mimic/hosp/admissions.csv.gz", 
                                            show_col_types = FALSE)) 
# show_col_types = FALSE to suppress the red message/ output of column types
underscore_csv_time
   user  system elapsed 
  0.868   0.084   0.505 
fread_time <- system.time(fread("~/mimic/hosp/admissions.csv.gz"))
fread_time
   user  system elapsed 
  0.339   0.033   0.385 

Answer - Speed

User System Elapsed 1.716 0.031 1.754

0.805 0.079 0.435

0.321 0.036 0.367

The function, fread, is the fastest as it outperforms the other two functions overall. Fread had an elapsed real time of 0.367 seconds, followed closely by read_csv with a run time of 0.435 seconds. Read.csv was the slowest with a run time of 1.754 seconds. Though fread had a slightly longer system time than read.csv by 0.005 seconds, it was still the fastest overall. In fact, it was about 4.78x faster than read.csv.

Note that the User time is the amount of CPU time spent by the current process, System time is the amount of CPU time spent by the kernel/operating system from executing the program, and the Elapsed time is the actual real time taken to execute the program. The total CPU time (user time + sys time) may be more or less than elapsed time because a program may spend some time waiting and not executing at all (whether in user mode or system mode) the real time may be greater than the total CPU time. Source1 and Source2

# Data type
dot_csv <- read.csv("~/mimic/hosp/admissions.csv.gz")
class(dot_csv)
[1] "data.frame"
str(dot_csv)
'data.frame':   431231 obs. of  16 variables:
 $ subject_id          : int  10000032 10000032 10000032 10000032 10000068 10000084 10000084 10000108 10000117 10000117 ...
 $ hadm_id             : int  22595853 22841357 25742920 29079034 25022803 23052089 29888819 27250926 22927623 27988844 ...
 $ admittime           : chr  "2180-05-06 22:23:00" "2180-06-26 18:27:00" "2180-08-05 23:44:00" "2180-07-23 12:35:00" ...
 $ dischtime           : chr  "2180-05-07 17:15:00" "2180-06-27 18:49:00" "2180-08-07 17:50:00" "2180-07-25 17:55:00" ...
 $ deathtime           : chr  "" "" "" "" ...
 $ admission_type      : chr  "URGENT" "EW EMER." "EW EMER." "EW EMER." ...
 $ admit_provider_id   : chr  "P874LG" "P09Q6Y" "P60CC5" "P30KEH" ...
 $ admission_location  : chr  "TRANSFER FROM HOSPITAL" "EMERGENCY ROOM" "EMERGENCY ROOM" "EMERGENCY ROOM" ...
 $ discharge_location  : chr  "HOME" "HOME" "HOSPICE" "HOME" ...
 $ insurance           : chr  "Other" "Medicaid" "Medicaid" "Medicaid" ...
 $ language            : chr  "ENGLISH" "ENGLISH" "ENGLISH" "ENGLISH" ...
 $ marital_status      : chr  "WIDOWED" "WIDOWED" "WIDOWED" "WIDOWED" ...
 $ race                : chr  "WHITE" "WHITE" "WHITE" "WHITE" ...
 $ edregtime           : chr  "2180-05-06 19:17:00" "2180-06-26 15:54:00" "2180-08-05 20:58:00" "2180-07-23 05:54:00" ...
 $ edouttime           : chr  "2180-05-06 23:30:00" "2180-06-26 21:31:00" "2180-08-06 01:44:00" "2180-07-23 14:00:00" ...
 $ hospital_expire_flag: int  0 0 0 0 0 0 0 0 0 0 ...
underscore_csv <- read_csv("~/mimic/hosp/admissions.csv.gz", 
                           show_col_types = FALSE)
class(underscore_csv)
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 
str(underscore_csv)
spc_tbl_ [431,231 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ subject_id          : num [1:431231] 1e+07 1e+07 1e+07 1e+07 1e+07 ...
 $ hadm_id             : num [1:431231] 22595853 22841357 25742920 29079034 25022803 ...
 $ admittime           : POSIXct[1:431231], format: "2180-05-06 22:23:00" "2180-06-26 18:27:00" ...
 $ dischtime           : POSIXct[1:431231], format: "2180-05-07 17:15:00" "2180-06-27 18:49:00" ...
 $ deathtime           : POSIXct[1:431231], format: NA NA ...
 $ admission_type      : chr [1:431231] "URGENT" "EW EMER." "EW EMER." "EW EMER." ...
 $ admit_provider_id   : chr [1:431231] "P874LG" "P09Q6Y" "P60CC5" "P30KEH" ...
 $ admission_location  : chr [1:431231] "TRANSFER FROM HOSPITAL" "EMERGENCY ROOM" "EMERGENCY ROOM" "EMERGENCY ROOM" ...
 $ discharge_location  : chr [1:431231] "HOME" "HOME" "HOSPICE" "HOME" ...
 $ insurance           : chr [1:431231] "Other" "Medicaid" "Medicaid" "Medicaid" ...
 $ language            : chr [1:431231] "ENGLISH" "ENGLISH" "ENGLISH" "ENGLISH" ...
 $ marital_status      : chr [1:431231] "WIDOWED" "WIDOWED" "WIDOWED" "WIDOWED" ...
 $ race                : chr [1:431231] "WHITE" "WHITE" "WHITE" "WHITE" ...
 $ edregtime           : POSIXct[1:431231], format: "2180-05-06 19:17:00" "2180-06-26 15:54:00" ...
 $ edouttime           : POSIXct[1:431231], format: "2180-05-06 23:30:00" "2180-06-26 21:31:00" ...
 $ hospital_expire_flag: num [1:431231] 0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, "spec")=
  .. cols(
  ..   subject_id = col_double(),
  ..   hadm_id = col_double(),
  ..   admittime = col_datetime(format = ""),
  ..   dischtime = col_datetime(format = ""),
  ..   deathtime = col_datetime(format = ""),
  ..   admission_type = col_character(),
  ..   admit_provider_id = col_character(),
  ..   admission_location = col_character(),
  ..   discharge_location = col_character(),
  ..   insurance = col_character(),
  ..   language = col_character(),
  ..   marital_status = col_character(),
  ..   race = col_character(),
  ..   edregtime = col_datetime(format = ""),
  ..   edouttime = col_datetime(format = ""),
  ..   hospital_expire_flag = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
fread <- fread("~/mimic/hosp/admissions.csv.gz")
class(fread)
[1] "data.table" "data.frame"
str(fread)
Classes 'data.table' and 'data.frame':  431231 obs. of  16 variables:
 $ subject_id          : int  10000032 10000032 10000032 10000032 10000068 10000084 10000084 10000108 10000117 10000117 ...
 $ hadm_id             : int  22595853 22841357 25742920 29079034 25022803 23052089 29888819 27250926 22927623 27988844 ...
 $ admittime           : POSIXct, format: "2180-05-06 22:23:00" "2180-06-26 18:27:00" ...
 $ dischtime           : POSIXct, format: "2180-05-07 17:15:00" "2180-06-27 18:49:00" ...
 $ deathtime           : POSIXct, format: NA NA ...
 $ admission_type      : chr  "URGENT" "EW EMER." "EW EMER." "EW EMER." ...
 $ admit_provider_id   : chr  "P874LG" "P09Q6Y" "P60CC5" "P30KEH" ...
 $ admission_location  : chr  "TRANSFER FROM HOSPITAL" "EMERGENCY ROOM" "EMERGENCY ROOM" "EMERGENCY ROOM" ...
 $ discharge_location  : chr  "HOME" "HOME" "HOSPICE" "HOME" ...
 $ insurance           : chr  "Other" "Medicaid" "Medicaid" "Medicaid" ...
 $ language            : chr  "ENGLISH" "ENGLISH" "ENGLISH" "ENGLISH" ...
 $ marital_status      : chr  "WIDOWED" "WIDOWED" "WIDOWED" "WIDOWED" ...
 $ race                : chr  "WHITE" "WHITE" "WHITE" "WHITE" ...
 $ edregtime           : POSIXct, format: "2180-05-06 19:17:00" "2180-06-26 15:54:00" ...
 $ edouttime           : POSIXct, format: "2180-05-06 23:30:00" "2180-06-26 21:31:00" ...
 $ hospital_expire_flag: int  0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, ".internal.selfref")=<externalptr> 

Answer - Parsed Data Types Yes there is a difference in the (default) parsed data types. Read.csv read the data into a dataframe, read_csv created tibbles, and fread outputted a data table. Not only were these different data structures, but the data types of the columns were also different. For example, subject_id and hadm_id have types integer “int” in read.csv and fread, but as numeric “num” in read_csv. Additionally, “admittime” was read as a character “char” in read.csv, but as a date “POSIXct” in read_csv and fread.

# Memory
dot_csv_size <- object_size(dot_csv)
dot_csv_size
158.71 MB
underscore_csv_size <- object_size(underscore_csv)
underscore_csv_size
55.31 MB
fread_size <- object_size(fread)
fread_size
50.13 MB

Answer - Memory

Read.csv had the highest memory usage with 158.71 MB. Next, was read_csv with 55.31 MB. Fread had the lowest memory usage with 50.13 MB. Thus, read.csv used about 2.89x more memory than read_csv, and read_csv used about 3.16x more memory than fread.

In summary, fread from data.table seems to be the best option for reading in the admissions.csv.gz file, as it was the fastest and most memory efficient. The fread read in the data as a data table, which has more advanced features than the traditional data frame. Read.csv from base R was the slowest and took up the most memory, and returned a data frame. Read_csv from tidyverse was in the middle in terms of speed and memory usage, and returned different types of tibbles.

Q1.2 User-supplied data types

Re-ingest admissions.csv.gz by indicating appropriate column data types in read_csv. Does the run time change? How much memory does the result tibble use? (Hint: col_types argument in read_csv.)

# User-supplied data types
corrected_col <- cols(
  subject_id = col_integer(),
  hadm_id = col_integer(),
  admittime = col_datetime(format = ""),
  dischtime = col_datetime(format = ""),
  deathtime = col_datetime(format = ""),
  admission_type = col_factor(levels = NULL), # factor
  admission_location = col_factor(levels = NULL), # factor
  discharge_location = col_factor(levels = NULL), # factor
  insurance = col_factor(levels = NULL), # factor
  language = col_factor(levels = NULL), # factor
  marital_status = col_factor(levels = NULL), # factor
  race = col_factor(levels = NULL), # factor
  edregtime = col_datetime(format = ""),
  edouttime = col_datetime(format = ""),
  hospital_expire_flag = col_integer() # should be small int
)

system.time(corrected_col_csv <- read_csv("~/mimic/hosp/admissions.csv.gz",
                                          col_types = corrected_col))
   user  system elapsed 
  0.789   0.076   0.409 
object_size(corrected_col_csv, )
38.06 MB
# Change time zone to UTC for consistency
corrected_col$admittime <- as_datetime(corrected_col$admittime, tz = "UTC")
corrected_col$dischtime <- as_datetime(corrected_col$dischtime, tz = "UTC")
corrected_col$deathtime <- as_datetime(corrected_col$deathtime, tz = "UTC")
corrected_col$edregtime <- as_datetime(corrected_col$edregtime, tz = "UTC")
corrected_col$edouttime <- as_datetime(corrected_col$edouttime, tz = "UTC")
corrected_col
cols(
  subject_id = col_integer(),
  hadm_id = col_integer(),
  admittime = col_datetime(format = ""),
  dischtime = col_datetime(format = ""),
  deathtime = col_datetime(format = ""),
  admission_type = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
  admission_location = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
  discharge_location = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
  insurance = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
  language = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
  marital_status = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
  race = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
  edregtime = col_datetime(format = ""),
  edouttime = col_datetime(format = ""),
  hospital_expire_flag = col_integer()
)
rm(list = ls())

Answer

I changed the column data types for variables admission_type-race from characters to factors and the hospital_expire_flag from a double to integer. I changed them to factors because they are categorical variables and saving as a factor can save lots of memory. Source. Because I do not plan to add or remove levels, I do not have to worry about the fixed nature of factors. Also, the MIMIC documentation says that the hospital_expire_flag variable should be in the form of small integer (small int). I kept the datetime format and made sure that it was in UTC format for consistency.

The size of the memory was about 38.06 MB, which is significantly less than 55.31 MB, the memory size of the original tibble read in by read_csv. The user time was 0.744 seconds, the system time was 0.083 seconds, and the elapsed run time was 0.385 seconds. This is about 0.05 seconds faster than the default read_csv, which had an elapsed run time of 0.435 seconds.

Q2. Ingest big data files

Let us focus on a bigger file, labevents.csv.gz, which is about 125x bigger than admissions.csv.gz.

ls -l ~/mimic/hosp/labevents.csv.gz
-rw-rw-r--@ 1 kathyhoang  staff  1939088924 Jan  5  2023 /Users/kathyhoang/mimic/hosp/labevents.csv.gz

Display the first 10 lines of this file.

zcat < ~/mimic/hosp/labevents.csv.gz | head -10
labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
1,10000032,,45421181,51237,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,1.4,1.4,,0.9,1.1,abnormal,ROUTINE,
2,10000032,,45421181,51274,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,___,15.1,sec,9.4,12.5,abnormal,ROUTINE,VERIFIED.
3,10000032,,52958335,50853,P28Z0X,2180-03-23 11:51:00,2180-03-25 11:06:00,___,15,ng/mL,30,60,abnormal,ROUTINE,NEW ASSAY IN USE ___: DETECTS D2 AND D3 25-OH ACCURATELY.
4,10000032,,52958335,50861,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,102,102,IU/L,0,40,abnormal,ROUTINE,
5,10000032,,52958335,50862,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,3.3,3.3,g/dL,3.5,5.2,abnormal,ROUTINE,
6,10000032,,52958335,50863,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,109,109,IU/L,35,105,abnormal,ROUTINE,
7,10000032,,52958335,50864,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,___,8,ng/mL,0,8.7,,ROUTINE,MEASURED BY ___.
8,10000032,,52958335,50868,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,12,12,mEq/L,8,20,,ROUTINE,
9,10000032,,52958335,50878,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,143,143,IU/L,0,40,abnormal,ROUTINE,

Q2.1 Ingest labevents.csv.gz by read_csv

Try to ingest labevents.csv.gz using read_csv. What happens? If it takes more than 5 minutes on your computer, then abort the program and report your findings.

# Note: I set this to eval = FALSE because the program will need to be aborted
lab_csv <- read_csv("~/mimic/hosp/labevents.csv.gz")

When I tried to ingest labevents.csv.gz using read_csv, the program took too long to run and did not output anything. Although read_csv from tidyverse has built-in support to automatically uncompress files Source, it seems that it is not able to handle the large size of labevents.csv.gz. It took more than 5 minutes and I had to terminate the program, thus the file may be too large for read_csv to read it in a reasonable amount of time.

Q2.2 Ingest selected columns of labevents.csv.gz by read_csv

Try to ingest only columns subject_id, itemid, charttime, and valuenum in labevents.csv.gz using read_csv. Does this solve the ingestion issue? (Hint: col_select argument in read_csv.)

# Note: You can set this to eval = FALSE if the program takes too long, 
#(took few minutes for me)

system.time(lab_selected_columns <- read_csv("~/mimic/hosp/labevents.csv.gz",
  col_select = c("subject_id", "itemid", "charttime", "valuenum")
))
   user  system elapsed 
 83.812  60.162 130.646 
lab_selected_columns
# A tibble: 118,171,367 × 4
   subject_id itemid charttime           valuenum
        <dbl>  <dbl> <dttm>                 <dbl>
 1   10000032  51237 2180-03-23 11:51:00      1.4
 2   10000032  51274 2180-03-23 11:51:00     15.1
 3   10000032  50853 2180-03-23 11:51:00     15  
 4   10000032  50861 2180-03-23 11:51:00    102  
 5   10000032  50862 2180-03-23 11:51:00      3.3
 6   10000032  50863 2180-03-23 11:51:00    109  
 7   10000032  50864 2180-03-23 11:51:00      8  
 8   10000032  50868 2180-03-23 11:51:00     12  
 9   10000032  50878 2180-03-23 11:51:00    143  
10   10000032  50882 2180-03-23 11:51:00     27  
# ℹ 118,171,357 more rows

Does this solve the ingestion issue?

Answer

Read_csv was able to ingest labevents.csv.gz with the selected columns. However, it seems that the file is still too large as it takes a very long time for read_csv to ingest it. When I timed it above, it had a user time of 83.242 seconds, a system time of 60.246 seconds, and an elapsed time of 131.663 seconds which means that it took about 2.19 minutes to run in real time. So while it was able to ingest it, I would not consider the ingestion issue completely solved.

Q2.3 Ingest subset of labevents.csv.gz

Our first strategy to handle this big data file is to make a subset of the labevents data. Read the MIMIC documentation for the content in data file labevents.csv.

In later exercises, we will only be interested in the following lab items: creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931) and the following columns: subject_id, itemid, charttime, valuenum. Write a Bash command to extract these columns and rows from labevents.csv.gz and save the result to a new file labevents_filtered.csv.gz in the current working directory. (Hint: use zcat < to pipe the output of labevents.csv.gz to awk and then to gzip to compress the output. To save render time, put #| eval: false at the beginning of this code chunk.)

Display the first 10 lines of the new file labevents_filtered.csv.gz. How many lines are in this new file? How long does it take read_csv to ingest labevents_filtered.csv.gz?

#Columns: `subject_id` (2), `itemid` (5, `charttime` (7), `valuenum` (10).
zcat < ~/mimic/hosp/labevents.csv.gz | awk -F, '($5 == 50912 || $5 == 50971 \
|| $5 == 50983 || $5 == 50902 || $5 == 50882 || $5 == 51221 || $5 == 51301 ||\
$5 == 50931) {print $2","$5","$7","$10}' | gzip > labevents_filtered.csv.gz 
zcat < labevents_filtered.csv.gz | head -n 10
10000032,50882,2180-03-23 11:51:00,27
10000032,50902,2180-03-23 11:51:00,101
10000032,50912,2180-03-23 11:51:00,0.4
10000032,50971,2180-03-23 11:51:00,3.7
10000032,50983,2180-03-23 11:51:00,136
10000032,50931,2180-03-23 11:51:00,95
10000032,51221,2180-03-23 11:51:00,45.4
10000032,51301,2180-03-23 11:51:00,3
10000032,51221,2180-05-06 22:25:00,42.6
10000032,51301,2180-05-06 22:25:00,5
zcat < labevents_filtered.csv.gz | head -n 10

#how many lines in this new file
zcat < labevents_filtered.csv.gz | wc -l
10000032,50882,2180-03-23 11:51:00,27
10000032,50902,2180-03-23 11:51:00,101
10000032,50912,2180-03-23 11:51:00,0.4
10000032,50971,2180-03-23 11:51:00,3.7
10000032,50983,2180-03-23 11:51:00,136
10000032,50931,2180-03-23 11:51:00,95
10000032,51221,2180-03-23 11:51:00,45.4
10000032,51301,2180-03-23 11:51:00,3
10000032,51221,2180-05-06 22:25:00,42.6
10000032,51301,2180-05-06 22:25:00,5
 24855909
system.time(read_csv("labevents_filtered.csv.gz"))
   user  system elapsed 
 13.308   1.047   4.064 

Answer There are 24855909 lines in the new file labevents_filtered.csv.gz. For read_csv to ingest it, it took 12.687 seconds for user time, 1.940 seconds for system time, and 6.179 seconds for elapsed real time.

Q2.4 Ingest labevents.csv by Apache Arrow

Our second strategy is to use Apache Arrow for larger-than-memory data analytics. Unfortunately Arrow does not work with gz files directly. First decompress labevents.csv.gz to labevents.csv and put it in the current working directory. To save render time, put #| eval: false at the beginning of this code chunk.

Then use arrow::open_dataset to ingest labevents.csv, select columns, and filter itemid as in Q2.3. How long does the ingest+select+filter process take? Display the number of rows and the first 10 rows of the result tibble, and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Write a few sentences to explain what is Apache Arrow. Imagine you want to explain it to a layman in an elevator.

zcat < ~/mimic/hosp/labevents.csv.gz > labevents.csv
#check that im in hw2 working directory
ls -l
total 86124912
-rw-r--r--  1 kathyhoang  staff        20583 Feb  1 00:54 arrow_logo.png
-rw-r--r--  1 kathyhoang  staff      1498586 Feb  1 00:54 bigfile.png
drwxr-xr-x  3 kathyhoang  staff           96 Feb  9 15:52 chartevents
-rw-rw-r--@ 1 kathyhoang  staff  30204420231 Feb  9 15:51 chartevents.csv
drwxr-xr-x  3 kathyhoang  staff           96 Feb  9 16:04 chartevents.parquet
-rw-r--r--  1 kathyhoang  staff         8369 Feb  1 00:54 duckdb_logo.png
-rw-r--r--  1 kathyhoang  staff      1842773 Feb  9 17:01 hw2.html
-rw-r--r--  1 kathyhoang  staff         7578 Feb  7 17:51 hw2.qmd
-rw-r--r--@ 1 kathyhoang  staff      4422019 Feb  9 20:47 hw2sol.html
-rw-r--r--  1 kathyhoang  staff        22873 Feb  9 21:12 hw2sol.qmd
-rw-r--r--  1 kathyhoang  staff        22910 Feb  9 21:12 hw2sol.rmarkdown
-rw-r--r--@ 1 kathyhoang  staff  13730083993 Feb  8 22:02 labevents.csv
drwxr-xr-x  3 kathyhoang  staff           96 Feb  9 00:44 labevents.parquet
-rw-r--r--@ 1 kathyhoang  staff          233 Feb  8 22:18 labevents_filtered.csv
-rw-r--r--@ 1 kathyhoang  staff    133163292 Feb  9 21:09 labevents_filtered.csv.gz
-rw-r--r--  1 kathyhoang  staff       304220 Feb  1 00:54 linux_logo.png
-rw-r--r--  1 kathyhoang  staff        28160 Feb  1 00:54 parquet_logo.png
-rw-r--r--  1 kathyhoang  staff        16297 Feb  1 00:54 readr_logo.png
lab_events <- arrow::open_dataset(sources = "labevents.csv", format = "csv") %>%
  dplyr::filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)) %>%
  dplyr::select(subject_id, itemid, charttime, valuenum) %>%
  dplyr::collect()
lab_events
# A tibble: 24,855,909 × 4
   subject_id itemid charttime           valuenum
        <int>  <int> <dttm>                 <dbl>
 1   10000032  50882 2180-03-23 04:51:00     27  
 2   10000032  50902 2180-03-23 04:51:00    101  
 3   10000032  50912 2180-03-23 04:51:00      0.4
 4   10000032  50971 2180-03-23 04:51:00      3.7
 5   10000032  50983 2180-03-23 04:51:00    136  
 6   10000032  50931 2180-03-23 04:51:00     95  
 7   10000032  51221 2180-03-23 04:51:00     45.4
 8   10000032  51301 2180-03-23 04:51:00      3  
 9   10000032  51221 2180-05-06 15:25:00     42.6
10   10000032  51301 2180-05-06 15:25:00      5  
# ℹ 24,855,899 more rows
# Change Time zone to UTC to match 2.3
Sys.timezone() # check current timezone
[1] "America/Los_Angeles"
lab_events$charttime <- as_datetime(lab_events$charttime, tz = "UTC")
lab_events
# A tibble: 24,855,909 × 4
   subject_id itemid charttime           valuenum
        <int>  <int> <dttm>                 <dbl>
 1   10000032  50882 2180-03-23 11:51:00     27  
 2   10000032  50902 2180-03-23 11:51:00    101  
 3   10000032  50912 2180-03-23 11:51:00      0.4
 4   10000032  50971 2180-03-23 11:51:00      3.7
 5   10000032  50983 2180-03-23 11:51:00    136  
 6   10000032  50931 2180-03-23 11:51:00     95  
 7   10000032  51221 2180-03-23 11:51:00     45.4
 8   10000032  51301 2180-03-23 11:51:00      3  
 9   10000032  51221 2180-05-06 22:25:00     42.6
10   10000032  51301 2180-05-06 22:25:00      5  
# ℹ 24,855,899 more rows
# How long does the ingest+select+filter process take?
system.time(arrow::open_dataset(sources = "labevents.csv", format = "csv") %>%
  dplyr::filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)) %>%
  dplyr::select(subject_id, itemid, charttime, valuenum) %>%
  dplyr::collect())
   user  system elapsed 
 33.636   2.090  31.423 

Answer The process of ingesting, selecting, and filtering the labevents.csv file took about 33.438 seconds in user time, 2.603 in system time, and 31.490 seconds in elapsed real time. The number of rows in the tibble is 24855909, which matches the number of lines in the new file labevents_filtered.csv.gz. The first 10 rows of the result tibble also match the first 10 lines of the new file labevents_filtered.csv.gz.

Apache Arrow is like a special tool that makes it easier for computers to handle big sets of information. It is useful for processing big data, especially when the datasets are so large that they cannot fit into the available memory of a single computer (RAM). It does this by allowing data on the disk to be operated on directly without having to load the dataset into memory first. Overall, it organizes the data in a way that makes working with large dataset easier and less costly.

Q2.5 Compress labevents.csv to Parquet format and ingest/select/filter

Re-write the csv file labevents.csv in the binary Parquet format (Hint: arrow::write_dataset.) How large is the Parquet file(s)? How long does the ingest+select+filter process of the Parquet file(s) take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Write a few sentences to explain what is the Parquet format. Imagine you want to explain it to a layman in an elevator.

# Re-write the csv file in the binary Parquet format
arrow::write_dataset(lab_events, path = "labevents.parquet", format = "parquet")
#size of parquet file(s)
ls -l labevents.parquet
ls -lh labevents.parquet
total 196648
-rw-r--r--  1 kathyhoang  staff  97734450 Feb  9 21:16 part-0.parquet
total 196648
-rw-r--r--  1 kathyhoang  staff    93M Feb  9 21:16 part-0.parquet
# Parquet

system.time(lab_events_parquet <- open_dataset("labevents.parquet") %>%
  dplyr::filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)) %>%
  dplyr::select(subject_id, itemid, charttime, valuenum) %>%
  dplyr::collect())
   user  system elapsed 
  0.675   0.129   0.158 
# Change time zone
lab_events_parquet$charttime <- as_datetime(lab_events_parquet$charttime, tz = "UTC")
# Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3
nrow(lab_events_parquet)
[1] 24855909
head(lab_events_parquet, 10)
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <int>  <int> <dttm>                 <dbl>
 1   10000032  50882 2180-03-23 11:51:00     27  
 2   10000032  50902 2180-03-23 11:51:00    101  
 3   10000032  50912 2180-03-23 11:51:00      0.4
 4   10000032  50971 2180-03-23 11:51:00      3.7
 5   10000032  50983 2180-03-23 11:51:00    136  
 6   10000032  50931 2180-03-23 11:51:00     95  
 7   10000032  51221 2180-03-23 11:51:00     45.4
 8   10000032  51301 2180-03-23 11:51:00      3  
 9   10000032  51221 2180-05-06 22:25:00     42.6
10   10000032  51301 2180-05-06 22:25:00      5  

The size of the Parquet file(s) is 97735089 bytes or 93 MB. The process of ingesting, selecting, and filtering the Parquet file(s) took about 0.686 seconds in user time, 0.191 seconds in system time, and 0.221 seconds in elapsed real time. The number of rows in the tibble is 24855909.

The Parquet format is a file format that is great for big data processing because it makes it efficient. It is a columnar storage format, which means that it stores data in columns rather than in rows. For example, imagine finding and reading specific chapters quickly in a book, rather than reading the entire book page by page. This is similar to how Parquet works. It makes it easier to read and write data, and it also makes it easier to perform operations on the data. Similar to Apache Arrow, it aims to improve the performance of working with big data. The main difference is that Appache Arrow manages data in-memory operations while Parquet focuses how data is stored on disk.

Q2.6 DuckDB

Ingest the Parquet file, convert it to a DuckDB table by arrow::to_duckdb, select columns, and filter rows as in Q2.5. How long does the ingest+convert+select+filter process take? Display the number of rows and the first 10 rows of the result tibble and make sure they match those in Q2.3. (Hint: use dplyr verbs for selecting columns and filtering rows.)

Write a few sentences to explain what is DuckDB. Imagine you want to explain it to a layman in an elevator.

# Duckdb
system.time(lab_events_duckdb <- lab_events_parquet %>%
  arrow::to_duckdb(table = "lab_events_duckdb") %>%
  dplyr::filter(itemid %in% c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)) %>%
  dplyr::select(subject_id, itemid, charttime, valuenum) %>%
  dplyr::collect())
   user  system elapsed 
  0.988   0.401   0.796 
# sort by subject_id and charttime to get the same result as Q2.3
lab_events_duckdb <- lab_events_duckdb %>%
  dplyr::arrange(subject_id, charttime)
# Change time zone
lab_events_duckdb$charttime <- as_datetime(lab_events_duckdb$charttime, tz = "UTC")
nrow(lab_events_duckdb)
[1] 24855909
head(lab_events_duckdb, 10)
# A tibble: 10 × 4
   subject_id itemid charttime           valuenum
        <int>  <int> <dttm>                 <dbl>
 1   10000032  50882 2180-03-23 11:51:00     27  
 2   10000032  50902 2180-03-23 11:51:00    101  
 3   10000032  50912 2180-03-23 11:51:00      0.4
 4   10000032  50971 2180-03-23 11:51:00      3.7
 5   10000032  50983 2180-03-23 11:51:00    136  
 6   10000032  50931 2180-03-23 11:51:00     95  
 7   10000032  51221 2180-03-23 11:51:00     45.4
 8   10000032  51301 2180-03-23 11:51:00      3  
 9   10000032  51221 2180-05-06 22:25:00     42.6
10   10000032  51301 2180-05-06 22:25:00      5  
rm(list = ls())

Answer

The process of ingesting, converting, selecting, and filtering the Parquet file(s) to DuckDB took about 0.974 seconds in user time, 0.602 seconds in system time, and 0.935 seconds in elapsed real time. The number of rows in the tibble is 24855909.

DuckDB is also a great tool for processing big data, it is a database management system and acts like a smart storage system for data on a computer. DuckDB is special because it handles queries in a clever and efficient way. For example, instead of doing it one by one, it can process many queries in batches, which makes them faster. Thus, it can handle big datasets and make it easier to work with them.

Q3. Ingest and filter chartevents.csv.gz

chartevents.csv.gz contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

zcat < ~/mimic/icu/chartevents.csv.gz | head -10
subject_id,hadm_id,stay_id,caregiver_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220179,82,82,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220180,59,59,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220181,63,63,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220045,94,94,bpm,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220179,85,85,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220180,55,55,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220181,62,62,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220210,20,20,insp/min,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220277,95,95,%,0

d_items.csv.gz is the dictionary for the itemid in chartevents.csv.gz.

zcat < ~/mimic/icu/d_items.csv.gz | head -10
itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
220001,Problem List,Problem List,chartevents,General,,Text,,
220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,
220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,
220048,Heart Rhythm,Heart Rhythm,chartevents,Routine Vital Signs,,Text,,
220050,Arterial Blood Pressure systolic,ABPs,chartevents,Routine Vital Signs,mmHg,Numeric,90,140
220051,Arterial Blood Pressure diastolic,ABPd,chartevents,Routine Vital Signs,mmHg,Numeric,60,90
220052,Arterial Blood Pressure mean,ABPm,chartevents,Routine Vital Signs,mmHg,Numeric,,

In later exercises, we are interested in the vitals for ICU patients: heart rate (220045), mean non-invasive blood pressure (220181), systolic non-invasive blood pressure (220179), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items, using the favorite method you learnt in Q2.

Document the steps and show code. Display the number of rows and the first 10 rows of the result tibble.

Answer

zcat < ~/mimic/icu/chartevents.csv.gz > chartevents.csv
#Decompress chartevents.csv.gz from the ICU folder
# Columns: subject_id (1), charttime (5), itemid (7), value (8)
chart_events <- arrow::open_dataset(sources = "chartevents.csv", format = "csv") %>%
  dplyr::filter(itemid %in% c(220045, 220181, 220179, 223761, 220210)) %>%
  dplyr::select(subject_id, itemid, charttime, value) %>%
  dplyr::collect()
chart_events
# A tibble: 22,502,319 × 4
   subject_id itemid charttime           value
        <int>  <int> <dttm>              <chr>
 1   10000032 220179 2180-07-23 14:01:00 82   
 2   10000032 220181 2180-07-23 14:01:00 63   
 3   10000032 220045 2180-07-23 15:00:00 94   
 4   10000032 220179 2180-07-23 15:00:00 85   
 5   10000032 220181 2180-07-23 15:00:00 62   
 6   10000032 220210 2180-07-23 15:00:00 20   
 7   10000032 220045 2180-07-23 12:00:00 97   
 8   10000032 220179 2180-07-23 12:00:00 93   
 9   10000032 220181 2180-07-23 12:00:00 56   
10   10000032 220210 2180-07-23 12:00:00 16   
# ℹ 22,502,309 more rows
head(chart_events, 10)
# A tibble: 10 × 4
   subject_id itemid charttime           value
        <int>  <int> <dttm>              <chr>
 1   10000032 220179 2180-07-23 14:01:00 82   
 2   10000032 220181 2180-07-23 14:01:00 63   
 3   10000032 220045 2180-07-23 15:00:00 94   
 4   10000032 220179 2180-07-23 15:00:00 85   
 5   10000032 220181 2180-07-23 15:00:00 62   
 6   10000032 220210 2180-07-23 15:00:00 20   
 7   10000032 220045 2180-07-23 12:00:00 97   
 8   10000032 220179 2180-07-23 12:00:00 93   
 9   10000032 220181 2180-07-23 12:00:00 56   
10   10000032 220210 2180-07-23 12:00:00 16   
# Rewrite as parquet
arrow::write_dataset(chart_events, path = "chartevents.parquet", format = "parquet")
chart_events_parquet <- open_dataset("chartevents.parquet") %>%
  dplyr::filter(itemid %in% c(220045, 220181, 220179, 223761, 220210)) %>%
  dplyr::select(subject_id, itemid, charttime, value) %>%
  dplyr::collect() %>%
  dplyr::arrange(subject_id, charttime)

# Change Time Zone to UTC
chart_events_parquet$charttime <- as_datetime(chart_events_parquet$charttime, tz = "UTC")
chart_events_parquet
# A tibble: 22,502,319 × 4
   subject_id itemid charttime           value
        <int>  <int> <dttm>              <chr>
 1   10000032 223761 2180-07-23 14:00:00 98.7 
 2   10000032 220179 2180-07-23 14:11:00 84   
 3   10000032 220181 2180-07-23 14:11:00 56   
 4   10000032 220045 2180-07-23 14:12:00 91   
 5   10000032 220210 2180-07-23 14:12:00 24   
 6   10000032 220045 2180-07-23 14:30:00 93   
 7   10000032 220179 2180-07-23 14:30:00 95   
 8   10000032 220181 2180-07-23 14:30:00 67   
 9   10000032 220210 2180-07-23 14:30:00 21   
10   10000032 220045 2180-07-23 15:00:00 94   
# ℹ 22,502,309 more rows
nrow(chart_events_parquet)
[1] 22502319
head(chart_events_parquet, 10)
# A tibble: 10 × 4
   subject_id itemid charttime           value
        <int>  <int> <dttm>              <chr>
 1   10000032 223761 2180-07-23 14:00:00 98.7 
 2   10000032 220179 2180-07-23 14:11:00 84   
 3   10000032 220181 2180-07-23 14:11:00 56   
 4   10000032 220045 2180-07-23 14:12:00 91   
 5   10000032 220210 2180-07-23 14:12:00 24   
 6   10000032 220045 2180-07-23 14:30:00 93   
 7   10000032 220179 2180-07-23 14:30:00 95   
 8   10000032 220181 2180-07-23 14:30:00 67   
 9   10000032 220210 2180-07-23 14:30:00 21   
10   10000032 220045 2180-07-23 15:00:00 94   
#Size of parquet file is only 96B compared to 28GB for the csv file
ls -lh
total 86124912
-rw-r--r--  1 kathyhoang  staff    20K Feb  1 00:54 arrow_logo.png
-rw-r--r--  1 kathyhoang  staff   1.4M Feb  1 00:54 bigfile.png
drwxr-xr-x  3 kathyhoang  staff    96B Feb  9 15:52 chartevents
-rw-rw-r--@ 1 kathyhoang  staff    28G Feb  9 15:51 chartevents.csv
drwxr-xr-x  3 kathyhoang  staff    96B Feb  9 16:04 chartevents.parquet
-rw-r--r--  1 kathyhoang  staff   8.2K Feb  1 00:54 duckdb_logo.png
-rw-r--r--  1 kathyhoang  staff   1.8M Feb  9 17:01 hw2.html
-rw-r--r--  1 kathyhoang  staff   7.4K Feb  7 17:51 hw2.qmd
-rw-r--r--@ 1 kathyhoang  staff   4.2M Feb  9 20:47 hw2sol.html
-rw-r--r--  1 kathyhoang  staff    22K Feb  9 21:12 hw2sol.qmd
-rw-r--r--  1 kathyhoang  staff    22K Feb  9 21:12 hw2sol.rmarkdown
-rw-r--r--@ 1 kathyhoang  staff    13G Feb  8 22:02 labevents.csv
drwxr-xr-x  3 kathyhoang  staff    96B Feb  9 00:44 labevents.parquet
-rw-r--r--@ 1 kathyhoang  staff   233B Feb  8 22:18 labevents_filtered.csv
-rw-r--r--@ 1 kathyhoang  staff   127M Feb  9 21:09 labevents_filtered.csv.gz
-rw-r--r--  1 kathyhoang  staff   297K Feb  1 00:54 linux_logo.png
-rw-r--r--  1 kathyhoang  staff    28K Feb  1 00:54 parquet_logo.png
-rw-r--r--  1 kathyhoang  staff    16K Feb  1 00:54 readr_logo.png

Answer I decided to use parquet to ingest and filter chartevents.csv.gz because it is a more efficient file format than csv. The parquet file is only 96B compared to 28GB for the csv file.

The process of ingesting, selecting, and filtering the Parquet file(s) took about 0.686 seconds in user time, 0.191 seconds in system time, and 0.221 seconds in elapsed real time. Thus, Parquet is a faster and more efficient file format than csv.

The number of rows in the tibble is 22502319 and the first 10 rows are displayed above.